Skip to content

feat(pkg/p2p): reconnect on disconnected peers#3212

Open
julienrbrt wants to merge 10 commits intomainfrom
julien/p2p
Open

feat(pkg/p2p): reconnect on disconnected peers#3212
julienrbrt wants to merge 10 commits intomainfrom
julien/p2p

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Mar 30, 2026

Attempt to reconnect on peers when having connectivity issue.
PR made by AI. Still testing.

Summary by CodeRabbit

  • Bug Fixes
    • Improved P2P resilience: automatic reconnection for transient network failures, periodic peer discovery refreshes, and stronger connection maintenance.
  • Documentation
    • Added an unreleased changelog entry noting the P2P connection improvements.
  • Chores
    • Re-enabled local module mappings to support building and testing against local source trees.
  • Refactor
    • Simplified retry loop implementation in cloud signers without changing retry behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds periodic DHT-based peer discovery and a background reconnect/maintenance loop to the P2P client, re-enables local Go module replace directives in two apps, and adjusts retry-loop syntax in AWS and GCP KMS signer implementations.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Added an unreleased entry: "Improve P2P connection in case of transient network failure" referencing PR #3212.
Go module replaces
apps/evm/go.mod, apps/testapp/go.mod
Uncommented local replace directives so github.com/evstack/ev-node (and subpath) resolve to local ../../ paths for development builds.
P2P connection maintenance
pkg/p2p/client.go
Added connection semaphore, seedPeers tracking, disconnect notifee, maintenance goroutine with periodic DHT re-advertise/FindPeers and concurrent connect attempts, and graceful shutdown of maintenance.
KMS signer loops
pkg/signer/aws/signer.go, pkg/signer/gcp/signer.go
Changed retry loop form from indexed for attempt := 0; attempt < maxAttempts; attempt++ to range-based for attempt := range maxAttempts, preserving backoff and retry behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client as P2P Client
    participant Host as libp2p Host
    participant DHT as DHT/Discovery
    participant Notifee as Disconnect Notifee
    participant Maintenance as Maintenance Loop

    Client->>Host: startWithHost()
    Client->>Notifee: register disconnect handler
    Client->>Maintenance: start maintenance goroutine

    Note over Host,Notifee: network event
    Host-->>Notifee: peer disconnected
    Notifee->>Maintenance: notify reconnectCh

    Maintenance->>Maintenance: apply per-peer backoff
    Maintenance->>Host: tryConnect(peer) (limited by connectSem)
    Host-->>Maintenance: connection result

    Maintenance->>DHT: periodic refreshPeerDiscovery()
    DHT->>Host: re-advertise namespace
    DHT-->>Maintenance: discovered peers
    Maintenance->>Host: tryConnect(discovered peers)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • randygrok
  • tac0turtle

Poem

🐰 I hop through nets both near and far,
I sniff the trails of every star.
When peers go missing in the night,
I nudge them back with patient might.
Reconnect! — the rabbit cheers outright.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is minimal and vague, missing the required Overview section with context, background, goal, rationale, and proper issue linking as specified in the template. Expand the description with an Overview section explaining the problem being solved, the approach taken, and link the related issue (#3212). Remove the note about AI generation and testing status.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding reconnection logic for disconnected peers in the P2P package, which is the primary purpose of the pkg/p2p/client.go modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/p2p

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 1, 2026, 11:22 AM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 30, 2026

Claude finished @julienrbrt's task in 1m 32s —— View job


Code Review

  • Gather context and read changed files
  • Review pkg/p2p/client.go (main change)
  • Review apps/evm/go.mod and apps/testapp/go.mod
  • Review KMS signer changes
  • Post final review

Summary

The PR adds a background maintenance loop for periodic DHT re-discovery and a disconnectNotifee to track seed peer connections. The architecture is straightforward, but there are several correctness issues that need to be addressed before this is production-ready.


Critical Issues

1. Goroutine leak in refreshPeerDiscovery (pkg/p2p/client.go:335)

c.advertise(ctx) calls discutil.Advertise(ctx, ...) which, per libp2p's implementation, starts a background re-advertisement goroutine that loops until ctx.Done(). Since refreshPeerDiscovery is called every 5 minutes via the ticker, a new long-lived goroutine is spawned every tick, accumulating indefinitely over the node's lifetime.

Fix: Call c.advertise(ctx) once when startConnectionMaintenance creates its context, not on every tick. The periodic ticker should only call FindPeers, not re-advertise.

func (c *Client) startConnectionMaintenance() {
    ctx, cancel := context.WithCancel(context.Background())
    c.maintenanceCancel = cancel

    c.maintenanceWg.Go(func() {
        c.advertise(ctx) // once, tied to maintenance lifetime

        discoveryTicker := time.NewTicker(peerDiscoveryInterval)
        defer discoveryTicker.Stop()
        for {
            select {
            case <-ctx.Done():
                return
            case <-discoveryTicker.C:
                c.refreshPeerDiscovery(ctx) // no advertise() call inside
            }
        }
    })
}

Fix this →

2. disconnectNotifee never triggers reconnection (pkg/p2p/client.go:291-298)

The Disconnected callback only logs when a seed peer disconnects. There is no reconnect signal sent. The maintenance loop has no mechanism to respond to disconnect events — it only runs on a 5-minute timer. A disconnected seed peer won't be retried until the next ticker fires (up to 5 minutes later).

If the intent was to reconnect reactively on disconnect, a channel signal is needed. If reconnection on disconnect was intentionally dropped (the "simplify" commit removed prior reconnect channel logic), the disconnectNotifee type still exists and registers itself via Notify() but does nothing meaningful beyond logging. This should either be completed or the Notify call removed.

Fix this →

3. Untracked dial goroutines race against Close() (pkg/p2p/client.go:349-352)

The goroutines spawned in refreshPeerDiscovery to dial peers are not tracked by maintenanceWg:

go func(peer peer.AddrInfo) {
    defer func() { <-c.connectSem }()
    c.tryConnect(ctx, peer)  // uses maintenance ctx, but not in maintenanceWg
}(p)

Close() cancels the context and then calls maintenanceWg.Wait(), but that only waits for the ticker goroutine, not these in-flight dials. The host can be closed (c.host.Close()) while dials are still in progress, causing use-after-free panics or confusing errors.

Fix: Use c.maintenanceWg.Go(...) instead of go func(...):

peerInfo := p
c.maintenanceWg.Go(func() {
    defer func() { <-c.connectSem }()
    c.tryConnect(ctx, peerInfo)
})

Fix this →


Moderate Issues

4. findPeers bypasses the connection semaphore (pkg/p2p/client.go:459-463)

findPeers (called during initial startup) spawns unbounded goroutines:

for peer := range peerCh {
    go c.tryConnect(ctx, peer)
}

But refreshPeerDiscovery (the maintenance path) correctly uses connectSem to limit concurrency. This inconsistency means startup can fan out to 60+ simultaneous dial attempts. Since connectSem is initialized before peerDiscovery() is called, there's no reason not to use it in findPeers.

5. started field is not protected by a mutex (pkg/p2p/client.go:65)

c.started is read in Start() and written in startWithHost() and Close() without any synchronization. If Start and Close are ever called concurrently (possible in test teardown scenarios), this is a data race. At minimum, document that Start/Close are not goroutine-safe, or protect with a mutex.

6. GetPeers blocks indefinitely with context.Background() (pkg/p2p/client.go:519)

peerCh, err := c.disc.FindPeers(context.Background(), c.getNamespace(), ...)

This has no timeout and no cancellation. If the DHT query hangs, this method never returns. It should accept a context from the caller or use one with a reasonable deadline.


Minor Issues

7. advertise() returns error but always returns nil

The function signature is misleading — discutil.Advertise returns nothing (it's void), so advertise() can never return an error. The return type should be removed and callers simplified. Also, _ = c.advertise(ctx) discards the return value explicitly, which is a code smell when the function always returns nil.

8. Committed replace directives in apps/evm/go.mod and apps/testapp/go.mod

The local replace directives (../../) work in the monorepo checkout but break any consumer that downloads these modules as versioned packages from a module proxy. For CI or external builds that don't have the full monorepo layout, this will fail to build. If these apps are strictly internal/monorepo-only and not intended to be published, that should be documented. Otherwise, use a go.work file (not committed, or committed separately as a dev-only artifact) for local development overrides.


KMS Signer Changes

The for attempt := range maxAttempts refactor in pkg/signer/aws/signer.go and pkg/signer/gcp/signer.go is clean and idiomatic for Go 1.22+. No issues.


Test Coverage

Codecov reports 39.47% patch coverage with 23 lines missed in pkg/p2p/client.go. The new maintenance and reconnect paths have minimal test coverage. Given that this is resilience-critical code, adding at least one integration test that simulates a peer disconnect and verifies recovery would significantly improve confidence.


Overall: The approach is sound but the goroutine leak (issue #1) and the untracked dial goroutines (issue #3) are correctness bugs that should be fixed before merging. The non-functional disconnect notifee (issue #2) either needs to be wired up or removed to avoid confusion.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.37%. Comparing base (4a70e0b) to head (cd87d23).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/p2p/client.go 48.27% 29 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3212      +/-   ##
==========================================
- Coverage   61.43%   61.37%   -0.07%     
==========================================
  Files         120      120              
  Lines       12504    12562      +58     
==========================================
+ Hits         7682     7710      +28     
- Misses       3960     3989      +29     
- Partials      862      863       +1     
Flag Coverage Δ
combined 61.37% <50.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt marked this pull request as ready for review March 31, 2026 16:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
apps/testapp/go.mod (1)

5-5: Keep this replace local to development.

This ties the module to ../../. being present at build time. That works inside this checkout, but not when the submodule is built outside the monorepo layout. Prefer a local go.work or an uncommitted dev override instead. Please verify this module is not expected to build outside the repository layout; if apps/testapp is intentionally monorepo-only, feel free to ignore this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/testapp/go.mod` at line 5, The committed go.mod contains a local replace
directive ("replace github.com/evstack/ev-node => ../../.") that forces a
monorepo layout; remove this replace from the tracked go.mod and instead keep it
as an uncommitted developer-only override (or move the replacement into a
go.work used only in development), or if this module is intended to be
monorepo-only, document that restriction; update the repository by deleting the
replace line from the committed go.mod and adding a dev-only go.work or
.gitignore'd override as appropriate so builds outside the repo layout are not
broken.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/evm/go.mod`:
- Around line 5-8: The go.mod contains committed relative replace directives for
modules "github.com/evstack/ev-node" and
"github.com/evstack/ev-node/execution/evm" (the two replace (...) entries) which
will break consumers outside the monorepo; remove these relative replace lines
from apps/evm/go.mod before committing and, for local development only, move
them into a go.work file or use developer instructions to run `go work use` so
local overrides remain available but are not shipped in the module's go.mod.

In `@pkg/p2p/client.go`:
- Around line 343-355: The reconnect loop currently only receives signals from
reconnectCh (fed by Disconnected) so a failed host.Connect leaves the peer
without future triggers, and the current logic drops events that arrive during
the backoff window; update the reconnect handling (references: reconnectCh,
Disconnected, host.Connect, st.attempts, st.lastAttempt, reconnectCooldown,
maxReconnectCooldown) so that on a connect failure you re-schedule another retry
(e.g., push a retry event into reconnectCh or start a time.AfterFunc/goroutine
to re-enqueue after computed backoff) and when an event arrives inside the
backoff window do not drop it but schedule it to be processed after the
remaining backoff interval (use a timer to requeue into reconnectCh); apply the
same scheduling change to the other similar block (lines around the second
instance handling backoff) so transient outages cause repeated retries rather
than a single attempt.
- Line 199: The maintenance goroutine is started with the caller-supplied ctx so
timeouts/cancels on Start() prematurely stop reconnection/discovery; change this
so maintenance uses its own background context owned by the client and only
stops when Close() is called. Create a maintenance context via
context.WithCancel(context.Background()) (store the cancel func on the client
struct), pass that maintenance context into startConnectionMaintenance (and the
other maintenance/startDiscovery loops currently using the Start() ctx around
lines 317-320), and invoke the stored cancel in Close(); do not use the Start()
ctx for these long-lived loops.

---

Nitpick comments:
In `@apps/testapp/go.mod`:
- Line 5: The committed go.mod contains a local replace directive ("replace
github.com/evstack/ev-node => ../../.") that forces a monorepo layout; remove
this replace from the tracked go.mod and instead keep it as an uncommitted
developer-only override (or move the replacement into a go.work used only in
development), or if this module is intended to be monorepo-only, document that
restriction; update the repository by deleting the replace line from the
committed go.mod and adding a dev-only go.work or .gitignore'd override as
appropriate so builds outside the repo layout are not broken.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c692055-7514-4e34-ad88-0d6d76c89f7d

📥 Commits

Reviewing files that changed from the base of the PR and between 4a70e0b and 32dcf35.

⛔ Files ignored due to path filters (2)
  • apps/evm/go.sum is excluded by !**/*.sum
  • apps/testapp/go.sum is excluded by !**/*.sum
📒 Files selected for processing (6)
  • CHANGELOG.md
  • apps/evm/go.mod
  • apps/testapp/go.mod
  • pkg/p2p/client.go
  • pkg/signer/aws/signer.go
  • pkg/signer/gcp/signer.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
pkg/p2p/client.go (1)

351-352: ⚠️ Potential issue | 🟠 Major

Backoff still turns a transient seed outage into a single retry.

At Line 351 the event is dropped if the peer is still inside its backoff window, and at Line 367 a failed host.Connect only logs. After that first failed dial there is no future trigger for that seed peer, so it can remain disconnected indefinitely.

Also applies to: 367-369

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 351 - 352, The code currently skips
processing a seed peer if time.Now().Before(st.lastAttempt.Add(backoff)) and
only logs on host.Connect failure, which means a transient outage may never be
retried; change this so that when a peer is within backoff you requeue/schedule
it for a future attempt instead of dropping the event (use a timer or push it
back onto the attempt queue and keep state in st), and when host.Connect(peer)
fails update st.lastAttempt and also schedule a follow-up retry (rather than
only logging) so the peer will be retried after backoff; make these changes
around the references to st.lastAttempt.Add(backoff) and the host.Connect call
to ensure failed dials trigger future retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/p2p/client.go`:
- Around line 317-321: startConnectionMaintenance currently only handles future
Disconnected events so seed peers that are already disconnected never get
enqueued; before the maintenance select loop begins (inside the maintenanceWg.Go
started by startConnectionMaintenance), iterate c.seedPeers and for each peer
that is not currently connected (use existing connection state check used
elsewhere in the file), send that peer into c.reconnectCh once so they will be
retried; ensure sends won't block the worker (use a non-blocking select with
default or spawn a short goroutine per send) and rely on
refreshPeerDiscovery/c.getNamespace as before for discovery updates.
- Around line 388-389: The current loop repeatedly calls c.advertise(ctx) which
spawns a background re-advertisement goroutine each time (discutil.Advertise
runs until ctx.Done()), causing a goroutine leak; fix by invoking
c.advertise(ctx) exactly once from the maintenance context (e.g., start it when
maintenanceCtx is created) and change the ticker loop to only call peer
discovery (e.g., c.FindPeers or the method that refreshes discovery) on each
tick, or alternatively modify c.advertise to provide a single-shot advertise API
and call that from the ticker; locate and update the code paths referencing
c.advertise(ctx) and the ticker loop so no repeated long-lived advertise
goroutines are started.

---

Duplicate comments:
In `@pkg/p2p/client.go`:
- Around line 351-352: The code currently skips processing a seed peer if
time.Now().Before(st.lastAttempt.Add(backoff)) and only logs on host.Connect
failure, which means a transient outage may never be retried; change this so
that when a peer is within backoff you requeue/schedule it for a future attempt
instead of dropping the event (use a timer or push it back onto the attempt
queue and keep state in st), and when host.Connect(peer) fails update
st.lastAttempt and also schedule a follow-up retry (rather than only logging) so
the peer will be retried after backoff; make these changes around the references
to st.lastAttempt.Add(backoff) and the host.Connect call to ensure failed dials
trigger future retries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d4f0858-895a-4ea8-aa2e-46955a48c244

📥 Commits

Reviewing files that changed from the base of the PR and between 32dcf35 and 2ff6cee.

📒 Files selected for processing (1)
  • pkg/p2p/client.go

Comment on lines +317 to +321
func (c *Client) startConnectionMaintenance() {
ctx, cancel := context.WithCancel(context.Background())
c.maintenanceCancel = cancel

c.maintenanceWg.Go(func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Queue currently disconnected seed peers when maintenance starts.

This worker only reacts to future Disconnected events. A seed peer that never connected during startup, or dropped before the notifee was registered, never enters reconnectCh, and refreshPeerDiscovery only searches c.getNamespace(). Seed any currently disconnected c.seedPeers once before the select loop begins.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 317 - 321, startConnectionMaintenance
currently only handles future Disconnected events so seed peers that are already
disconnected never get enqueued; before the maintenance select loop begins
(inside the maintenanceWg.Go started by startConnectionMaintenance), iterate
c.seedPeers and for each peer that is not currently connected (use existing
connection state check used elsewhere in the file), send that peer into
c.reconnectCh once so they will be retried; ensure sends won't block the worker
(use a non-blocking select with default or spawn a short goroutine per send) and
rely on refreshPeerDiscovery/c.getNamespace as before for discovery updates.

Comment on lines +388 to +389
_ = c.advertise(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For github.com/libp2p/go-libp2p v0.48.0, what does p2p/discovery/util.Advertise do? Does it start a background re-advertisement loop tied to ctx.Done(), or is it a single synchronous advertise call?

💡 Result:

In github.com/libp2p/go-libp2p/p2p/discovery/util v0.48.0, util.Advertise(ctx, a, ns, ...) starts a background goroutine that re-advertises in a loop until ctx.Done() is closed—it is not a single synchronous advertise call. (raw.githubusercontent.com)

Behavior details:

Citations:


Fix goroutine leak in advertise refresh loop.

Calling c.advertise(ctx) every 5 minutes at line 388 spawns a new background goroutine with each call. In libp2p v0.48.0, discutil.Advertise starts a re-advertisement loop that runs until ctx.Done() — it is not a single synchronous call. Stacking these goroutines will leak resources for the client's lifetime.

Start c.advertise(ctx) once from the maintenance context and let the ticker only refresh peer discovery via FindPeers, or switch to a single-shot advertise call that does not spawn a background loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 388 - 389, The current loop repeatedly calls
c.advertise(ctx) which spawns a background re-advertisement goroutine each time
(discutil.Advertise runs until ctx.Done()), causing a goroutine leak; fix by
invoking c.advertise(ctx) exactly once from the maintenance context (e.g., start
it when maintenanceCtx is created) and change the ticker loop to only call peer
discovery (e.g., c.FindPeers or the method that refreshes discovery) on each
tick, or alternatively modify c.advertise to provide a single-shot advertise API
and call that from the ticker; locate and update the code paths referencing
c.advertise(ctx) and the ticker loop so no repeated long-lived advertise
goroutines are started.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/p2p/client.go (2)

317-320: ⚠️ Potential issue | 🟠 Major

Still unresolved: keep only one advertise worker alive.

refreshPeerDiscovery() runs on a ticker, but advertise() already starts a long-lived re-advertisement worker. Calling it again here stacks extra advertisers for the lifetime of the client. Start one advertiser from the maintenance context and let the ticker only refresh discovery. If maintenance owns the long-lived advertiser, remove the startup peerDiscovery() call too so there is still only one loop total.

For the go-libp2p version used by this repository, does `p2p/discovery/util.Advertise` start a background re-advertisement loop until `ctx.Done()`, or is it a one-shot advertise call?

Based on learnings: Applies to **/*.go : Be mindful of goroutine leaks in Go code.

Also applies to: 399-400

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 317 - 320, startConnectionMaintenance
currently starts a maintenance context but also leads to multiple concurrent
advertisers because refreshPeerDiscovery calls advertise repeatedly while
advertise itself spawns a long-lived re-advertisement worker; to fix, ensure
only one long-lived advertise worker is created from startConnectionMaintenance
using the maintenance ctx (call advertise(ctx) once there), remove any other
startup call to peerDiscovery() so you don't spawn a separate loop, and change
refreshPeerDiscovery to only refresh discovery state (not call advertise) so it
no longer creates additional advertisers; reference functions:
startConnectionMaintenance, refreshPeerDiscovery, advertise, peerDiscovery and
the maintenanceCancel/maintenance context.

321-324: ⚠️ Potential issue | 🟠 Major

Still unresolved: seed the reconnect queue with already-disconnected seed peers.

This worker still only reacts to future Disconnected callbacks. A configured seed peer that never connected during startup, or dropped before Notify() was registered, never reaches reconnectCh, so the reconnect path never runs for it.

🔧 Minimal fix
 	c.maintenanceWg.Go(func() {
+		for _, sp := range c.seedPeers {
+			if c.isConnected(sp.ID) {
+				continue
+			}
+			select {
+			case c.reconnectCh <- sp.ID:
+			default:
+			}
+		}
+
 		discoveryTicker := time.NewTicker(peerDiscoveryInterval)
 		defer discoveryTicker.Stop()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 321 - 324, The worker registered in
c.maintenanceWg.Go only enqueues peers on future Disconnected callbacks, so
configured seed peers that were never connected or disconnected before Notify
was registered are never added to c.reconnectCh; fix by seeding the reconnect
queue at startup: after creating discoveryTicker in the maintenance goroutine
(inside the same block started by maintenanceWg.Go), iterate configured seed
peers (the seeds list used by the peer discovery logic), check each peer's
current connectedness via the host/peerstore/Network API (e.g.,
Host.Network().Connectedness / Peerstore.Connectedness), and send any peers that
are not currently connected into c.reconnectCh (respecting non-blocking
semantics or select with ctx.Done() to avoid blocking). Ensure this happens once
before the ticker loop so the existing reconnect worker will process
already-disconnected seed peers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/p2p/client.go`:
- Around line 206-209: Close() currently cancels maintenanceCancel and waits
only for maintenanceWg but discovery and dial goroutines (spawned by the
maintenance worker) are not time-boxed or tracked, so add a child context with a
short timeout for the periodic discovery/refresh and for each dial/Connect call
(derive via maintenanceCtx := context.WithTimeout(maintenanceCancelCtx,
shortDuration)), ensure every spawned dial worker is registered with the same
maintenanceWg (or use an errgroup bound to maintenanceCtx) so they are waited
on, pass the child ctx into the dial/Connect/DHT calls so they can be cancelled,
and guarantee connectSem is released in a defer inside each worker even on
timeout/cancellation; also ensure maintenanceCancel cancels the parent ctx so
all child timeouts abort during Close().

---

Duplicate comments:
In `@pkg/p2p/client.go`:
- Around line 317-320: startConnectionMaintenance currently starts a maintenance
context but also leads to multiple concurrent advertisers because
refreshPeerDiscovery calls advertise repeatedly while advertise itself spawns a
long-lived re-advertisement worker; to fix, ensure only one long-lived advertise
worker is created from startConnectionMaintenance using the maintenance ctx
(call advertise(ctx) once there), remove any other startup call to
peerDiscovery() so you don't spawn a separate loop, and change
refreshPeerDiscovery to only refresh discovery state (not call advertise) so it
no longer creates additional advertisers; reference functions:
startConnectionMaintenance, refreshPeerDiscovery, advertise, peerDiscovery and
the maintenanceCancel/maintenance context.
- Around line 321-324: The worker registered in c.maintenanceWg.Go only enqueues
peers on future Disconnected callbacks, so configured seed peers that were never
connected or disconnected before Notify was registered are never added to
c.reconnectCh; fix by seeding the reconnect queue at startup: after creating
discoveryTicker in the maintenance goroutine (inside the same block started by
maintenanceWg.Go), iterate configured seed peers (the seeds list used by the
peer discovery logic), check each peer's current connectedness via the
host/peerstore/Network API (e.g., Host.Network().Connectedness /
Peerstore.Connectedness), and send any peers that are not currently connected
into c.reconnectCh (respecting non-blocking semantics or select with ctx.Done()
to avoid blocking). Ensure this happens once before the ticker loop so the
existing reconnect worker will process already-disconnected seed peers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ea60430-76a6-4442-9628-2a12b7b139d4

📥 Commits

Reviewing files that changed from the base of the PR and between 2ff6cee and e276ea2.

📒 Files selected for processing (1)
  • pkg/p2p/client.go

Comment on lines +206 to +209
if c.maintenanceCancel != nil {
c.maintenanceCancel()
}
c.maintenanceWg.Wait()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Track and time-box the maintenance dials.

Close() only waits for the dispatcher goroutine. Lines 373-381 and 413-416 spawn dial workers outside maintenanceWg, while Line 385 runs discovery inline on the only maintenance worker. A slow discovery/dial can block reconnect handling, pin connectSem, or still be running when teardown closes the host and DHT. Give the refresh and each dial a short child timeout, and track the spawned workers in the same wait group/errgroup.

As per coding guidelines "Use context.Context for cancellation in Go" and "Be mindful of goroutine leaks in Go code".

Also applies to: 373-381, 384-385, 401-419

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 206 - 209, Close() currently cancels
maintenanceCancel and waits only for maintenanceWg but discovery and dial
goroutines (spawned by the maintenance worker) are not time-boxed or tracked, so
add a child context with a short timeout for the periodic discovery/refresh and
for each dial/Connect call (derive via maintenanceCtx :=
context.WithTimeout(maintenanceCancelCtx, shortDuration)), ensure every spawned
dial worker is registered with the same maintenanceWg (or use an errgroup bound
to maintenanceCtx) so they are waited on, pass the child ctx into the
dial/Connect/DHT calls so they can be cancelled, and guarantee connectSem is
released in a defer inside each worker even on timeout/cancellation; also ensure
maintenanceCancel cancels the parent ctx so all child timeouts abort during
Close().

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/p2p/client.go (2)

312-317: ⚠️ Potential issue | 🟠 Major

Dial workers are not tracked by maintenance waitgroup and are not time-boxed.

The goroutines spawned at Line 314 are outside maintenanceWg and use the parent maintenance context without per-dial timeout, so Close() can wait for the loop yet still race with in-flight dials. Track these workers in maintenanceWg and wrap each dial with a short context.WithTimeout.

🔧 Focused fix
 		select {
 		case c.connectSem <- struct{}{}:
-			go func(peer peer.AddrInfo) {
-				defer func() { <-c.connectSem }()
-				c.tryConnect(ctx, peer)
-			}(p)
+			peerInfo := p
+			c.maintenanceWg.Go(func() {
+				defer func() { <-c.connectSem }()
+				dialCtx, cancel := context.WithTimeout(ctx, 15*time.Second)
+				defer cancel()
+				c.tryConnect(dialCtx, peerInfo)
+			})
 		case <-ctx.Done():
 			return
 		}

Based on learnings "Applies to **/.go : Use context.Context for cancellation in Go" and "Applies to **/.go : Be mindful of goroutine leaks in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 312 - 317, The spawned dial goroutines
started when sending to connectSem are not tracked by maintenanceWg and use the
parent maintenance context without a per-dial timeout; modify the goroutine that
calls tryConnect so you increment maintenanceWg (maintenanceWg.Add(1)) before
launching, defer maintenanceWg.Done() inside the goroutine, and wrap the parent
maintenance context with a short timeout via context.WithTimeout (e.g., ctxDial,
cancel := context.WithTimeout(ctx, <reasonable-duration>); defer cancel()) and
pass ctxDial into tryConnect; keep the existing defer that releases connectSem
so the semaphore is still cleared even on timeout or cancellation.

300-301: ⚠️ Potential issue | 🟠 Major

c.advertise(ctx) in the ticker path can stack long-lived advertise loops.

Calling c.advertise(ctx) on every refresh tick (Line 300) is still risky if discutil.Advertise is loop-based for the lifetime of ctx; this can accumulate goroutines over time. Move advertise startup to a single call when maintenance starts, and keep the ticker for peer refresh only.

For github.com/libp2p/go-libp2p v0.48.0, does p2p/discovery/util.Advertise start a background re-advertisement loop until ctx.Done(), or is it a one-shot synchronous advertise call?

As per coding guidelines "Be mindful of goroutine leaks in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 300 - 301, The call to c.advertise(ctx)
inside the refresh ticker can spawn long-lived advertise loops repeatedly if
discutil.Advertise runs until ctx.Done(); remove the per-tick call and instead
start advertising once when maintenance begins: call c.advertise(ctx) (or spawn
a single goroutine that calls discutil.Advertise with the maintenance context)
in the maintenance startup path, keep the ticker handling only peer refresh
logic, and ensure the advertise goroutine is tied to the same cancellable
context (or returns on a done channel) so it is stopped when maintenance stops
to avoid goroutine leaks; update any start/stop or runMaintenance functions to
start that single advertise instance and to cancel it on shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/p2p/client.go`:
- Around line 312-317: The spawned dial goroutines started when sending to
connectSem are not tracked by maintenanceWg and use the parent maintenance
context without a per-dial timeout; modify the goroutine that calls tryConnect
so you increment maintenanceWg (maintenanceWg.Add(1)) before launching, defer
maintenanceWg.Done() inside the goroutine, and wrap the parent maintenance
context with a short timeout via context.WithTimeout (e.g., ctxDial, cancel :=
context.WithTimeout(ctx, <reasonable-duration>); defer cancel()) and pass
ctxDial into tryConnect; keep the existing defer that releases connectSem so the
semaphore is still cleared even on timeout or cancellation.
- Around line 300-301: The call to c.advertise(ctx) inside the refresh ticker
can spawn long-lived advertise loops repeatedly if discutil.Advertise runs until
ctx.Done(); remove the per-tick call and instead start advertising once when
maintenance begins: call c.advertise(ctx) (or spawn a single goroutine that
calls discutil.Advertise with the maintenance context) in the maintenance
startup path, keep the ticker handling only peer refresh logic, and ensure the
advertise goroutine is tied to the same cancellable context (or returns on a
done channel) so it is stopped when maintenance stops to avoid goroutine leaks;
update any start/stop or runMaintenance functions to start that single advertise
instance and to cancel it on shutdown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28a990a3-7f37-4d5c-9247-20556ba27af8

📥 Commits

Reviewing files that changed from the base of the PR and between e276ea2 and 9259f1a.

📒 Files selected for processing (1)
  • pkg/p2p/client.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/p2p/client.go (1)

190-191: Consider passing context parameter to satisfy linter, even if unused.

The golangci-lint contextcheck rule flags this because startConnectionMaintenance doesn't receive the context from startWithHost. While the design is correct (maintenance uses context.Background() to decouple from caller), you could either:

  1. Disable contextcheck for this line with //nolint:contextcheck
  2. Accept a context.Context parameter but explicitly ignore it, documenting why
Option 1: Disable linter for this line
 	c.host.Network().Notify(c.newDisconnectNotifee())
-	c.startConnectionMaintenance()
+	c.startConnectionMaintenance() //nolint:contextcheck // maintenance uses independent background context
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 190 - 191, The linter flags
startConnectionMaintenance for not accepting the caller context; update
startConnectionMaintenance to accept a context.Context parameter (e.g., func (c
*Client) startConnectionMaintenance(ctx context.Context)) and change the call in
startWithHost from c.startConnectionMaintenance() to
c.startConnectionMaintenance(ctx). Inside startConnectionMaintenance, explicitly
ignore the passed-in ctx (e.g., _ = ctx) and add a short comment explaining you
intentionally use context.Background() internally to decouple maintenance from
the caller; this satisfies contextcheck while preserving the original design.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/p2p/client.go`:
- Around line 308-325: The code calls maintenanceWg.Go(...) but maintenanceWg is
a sync.WaitGroup which has no Go method; replace that call with the manual
WaitGroup goroutine pattern in startConnectionMaintenance: call
maintenanceWg.Add(1) before launching a goroutine, then start a plain go func()
that defers maintenanceWg.Done() and contains the existing ticker loop/body
(including ctx, cancel, discoveryTicker). This preserves the existing behavior
without changing maintenanceWg's type; alternatively, if you prefer to keep the
Go() usage, change the maintenanceWg field to an errgroup.Group and update
callers, but the minimal fix is to use Add(1) + go func() + defer Done() around
the current closure.

---

Nitpick comments:
In `@pkg/p2p/client.go`:
- Around line 190-191: The linter flags startConnectionMaintenance for not
accepting the caller context; update startConnectionMaintenance to accept a
context.Context parameter (e.g., func (c *Client) startConnectionMaintenance(ctx
context.Context)) and change the call in startWithHost from
c.startConnectionMaintenance() to c.startConnectionMaintenance(ctx). Inside
startConnectionMaintenance, explicitly ignore the passed-in ctx (e.g., _ = ctx)
and add a short comment explaining you intentionally use context.Background()
internally to decouple maintenance from the caller; this satisfies contextcheck
while preserving the original design.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e5805969-0d39-43a6-8794-a9173351ab03

📥 Commits

Reviewing files that changed from the base of the PR and between 9259f1a and cd87d23.

📒 Files selected for processing (1)
  • pkg/p2p/client.go

Comment on lines +308 to +325
func (c *Client) startConnectionMaintenance() {
ctx, cancel := context.WithCancel(context.Background())
c.maintenanceCancel = cancel

c.maintenanceWg.Go(func() {
discoveryTicker := time.NewTicker(peerDiscoveryInterval)
defer discoveryTicker.Stop()

for {
select {
case <-ctx.Done():
return
case <-discoveryTicker.C:
c.refreshPeerDiscovery(ctx)
}
}
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify sync.WaitGroup usage and check if errgroup is used elsewhere
rg "sync\.WaitGroup" --type go -l | head -5
rg "errgroup\.Group" --type go -l | head -5

Repository: evstack/ev-node

Length of output: 235


🏁 Script executed:

#!/bin/bash
# Find the Client struct definition and maintenanceWg field type
ast-grep --pattern $'type Client struct {
  $$$
  maintenanceWg $_
  $$$
}'

Repository: evstack/ev-node

Length of output: 944


🏁 Script executed:

#!/bin/bash
# Check imports in pkg/p2p/client.go
head -30 pkg/p2p/client.go | cat -n

Repository: evstack/ev-node

Length of output: 1132


🏁 Script executed:

#!/bin/bash
# Broader search for maintenanceWg declaration
rg "maintenanceWg\s*[:\=]" --type go -A 1 -B 1 pkg/p2p/

Repository: evstack/ev-node

Length of output: 164


🏁 Script executed:

#!/bin/bash
# Double-check: search for any other imports or type definitions that might affect maintenanceWg
rg "maintenanceWg" pkg/p2p/client.go | head -10

Repository: evstack/ev-node

Length of output: 145


sync.WaitGroup does not have a Go() method — this code will not compile.

The maintenanceWg field is declared as sync.WaitGroup (line 70), which only provides Add(), Done(), and Wait() methods. Either use golang.org/x/sync/errgroup.Group (which provides a Go() method) or use the manual pattern:

Fix using manual WaitGroup pattern
 func (c *Client) startConnectionMaintenance() {
 	ctx, cancel := context.WithCancel(context.Background())
 	c.maintenanceCancel = cancel

-	c.maintenanceWg.Go(func() {
+	c.maintenanceWg.Add(1)
+	go func() {
+		defer c.maintenanceWg.Done()
 		discoveryTicker := time.NewTicker(peerDiscoveryInterval)
 		defer discoveryTicker.Stop()

 		for {
 			select {
 			case <-ctx.Done():
 				return
 			case <-discoveryTicker.C:
 				c.refreshPeerDiscovery(ctx)
 			}
 		}
-	})
+	}()
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (c *Client) startConnectionMaintenance() {
ctx, cancel := context.WithCancel(context.Background())
c.maintenanceCancel = cancel
c.maintenanceWg.Go(func() {
discoveryTicker := time.NewTicker(peerDiscoveryInterval)
defer discoveryTicker.Stop()
for {
select {
case <-ctx.Done():
return
case <-discoveryTicker.C:
c.refreshPeerDiscovery(ctx)
}
}
})
}
func (c *Client) startConnectionMaintenance() {
ctx, cancel := context.WithCancel(context.Background())
c.maintenanceCancel = cancel
c.maintenanceWg.Add(1)
go func() {
defer c.maintenanceWg.Done()
discoveryTicker := time.NewTicker(peerDiscoveryInterval)
defer discoveryTicker.Stop()
for {
select {
case <-ctx.Done():
return
case <-discoveryTicker.C:
c.refreshPeerDiscovery(ctx)
}
}
}()
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/p2p/client.go` around lines 308 - 325, The code calls
maintenanceWg.Go(...) but maintenanceWg is a sync.WaitGroup which has no Go
method; replace that call with the manual WaitGroup goroutine pattern in
startConnectionMaintenance: call maintenanceWg.Add(1) before launching a
goroutine, then start a plain go func() that defers maintenanceWg.Done() and
contains the existing ticker loop/body (including ctx, cancel, discoveryTicker).
This preserves the existing behavior without changing maintenanceWg's type;
alternatively, if you prefer to keep the Go() usage, change the maintenanceWg
field to an errgroup.Group and update callers, but the minimal fix is to use
Add(1) + go func() + defer Done() around the current closure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant